Locality Sensitive Hashing

Question 1

The edit distance is the minimum number of character insertions and character deletions required to turn one string into another. Compute the edit distance between each pair of the strings he, she, his, and hers. Then, identify the pairs at each edit distance.



In [1]:

    
from collections import defaultdict
from itertools import combinations



In [2]:

    
def lcs(a, b):
    lengths = [[0 for j in range(len(b)+1)] for i in range(len(a)+1)]
    for i, x in enumerate(a):
        for j, y in enumerate(b):
            if x == y:
                lengths[i+1][j+1] = lengths[i][j] + 1
            else:
                lengths[i+1][j+1] = max(lengths[i+1][j], lengths[i][j+1])
    # read the substring out from the matrix
    result = ""
    x, y = len(a), len(b)
    while x != 0 and y != 0:
        if lengths[x][y] == lengths[x-1][y]:
            x -= 1
        elif lengths[x][y] == lengths[x][y-1]:
            y -= 1
        else:
            assert a[x-1] == b[y-1]
            result = a[x-1] + result
            x -= 1
            y -= 1
    return result



In [3]:

    
strings = ["he","she","his","hers"]
pairs =  list(combinations(strings,2))

length = defaultdict(list)
for pair in pairs:
    x = pair[0]
    y = pair[1]
    dist = len(x) + len(y) - 2 * (len(lcs(x,y)))
    length[dist].append((x,y))

for k,v in length.items():
    print k,v









    



1 [('he', 'she')]
2 [('he', 'hers')]
3 [('he', 'his'), ('she', 'hers'), ('his', 'hers')]
4 [('she', 'his')]

Question 2

Consider the following matrix:

	C1	C2	C3	C4
R1	0	1	1	0
R2	1	0	1	1
R3	0	1	0	1
R4	0	0	1	0
R5	1	0	1	0
R6	0	1	0	0

Perform a minhashing of the data, with the order of rows: R4, R6, R1, R3, R5, R2. Which of the following is the correct minhash value of the stated column? Note: we give the minhash value in terms of the original name of the row, rather than the order of the row in the permutation. These two schemes are equivalent, since we only care whether hash values for two columns are equal, not what their actual values are.



In [4]:

    
import numpy as np

def convertToRows(minhash_row, order):
    the_rows = []

    for item in minhash_row:
        the_rows.append(order[item - 1])

    return the_rows

def minhash(matrix, order):
    m = matrix.shape[1]
    minhash_row = [0] * m
    i = 1

    for r in order:
        row = matrix[r - 1]
        for c in range(0, m):
            if minhash_row[c] == 0:
                minhash_row[c] = i * row[c]
        i += 1
        if 0 not in minhash_row:
            break

    return minhash_row



In [5]:

    
matrix = np.array([[0, 1, 1, 0],
                   [1, 0, 1, 1],
                   [0, 1, 0, 1],
                   [0, 0, 1, 0],
                   [1, 0, 1, 0],
                   [0, 1, 0, 0]])
order = [4, 6, 1, 3, 5, 2]
minhash_row = minhash(matrix, order)
rows_that_contributed = convertToRows(minhash_row, order)
print(rows_that_contributed)









    



[5, 6, 4, 3]

Question 3

Consider a matrix representing the signatures of seven columns, C1 through C7:

C1	C2	C3	C4	C5	C6	C7
1	2	1	1	2	5	4
2	3	4	2	3	2	2
3	1	2	3	1	3	2
4	1	3	1	2	4	4
5	2	5	1	1	5	1
6	1	6	4	1	1	4

Suppose we use locality-sensitive hashing with three bands of two rows each. Assume there are enough buckets available that the hash function for each band can be the identity function (i.e., columns hash to the same bucket if and only if they are identical in the band). Find all the candidate pairs.



In [6]:

    
def candPair(x, y):
    i = 0
    while i < len(x):
        j = i + 1
        while j < len(x):
            if x[i] == x [j] and y[i] == y[j]:
                print "C%s,C%s" % (i+1,j+1)
            j += 1
        i += 1

print "Candidate pairs:"
candPair([1,2,1,1,2,5,4], [2,3,4,2,3,2,2])
candPair([3,1,2,3,1,3,2], [4,1,3,1,2,4,4])
candPair([5,2,5,1,1,5,1], [6,1,6,4,1,1,4])









    



Candidate pairs:
C1,C4
C2,C5
C1,C6
C1,C3
C4,C7

Question 4

Find the set of 2-shingles for the "document": ABRACADABRA and also for the "document": BRICABRAC. Answer the following questions:

How many 2-shingles does ABRACADABRA have?
How many 2-shingles does BRICABRAC have?
How many 2-shingles do they have in common?
What is the Jaccard similarity between the two documents"?



In [7]:

    
str1 = "ABRACADABRA"
str2 = "BRICABRAC"

def getShingles(str):
    shingles = []
    doc_shingles = []
    for i in range(len(str)-1):
        shingles.append(str[i]+str[i+1])
    return set(shingles)
        
def getCommon(list1,list2):
    s1 = set(list1)
    s2 = set(list2)
    return len(s1.intersection(s2))

def getTotal(list1,list2):
    s1 = set(list1)
    s2 = set(list2)
    return len(s1.union(s2))



In [8]:

    
print "Set of 2-shinlges in ABRACADABRA =",  getShingles(str1)
print "Set of 2-shinlges in BRICABRAC =", getShingles(str2)
print
print "Number of 2-shinlges in ABRACADABRA = ", len(getShingles(str1))
print "Number of 2-shinlges in BRICABRAC = ", len(getShingles(str2))
print
print "Number of Common Shingles = ", getCommon(getShingles(str1),getShingles(str2))
print "Total Number of Shingles = ", getTotal(getShingles(str1),getShingles(str2))









    



Set of 2-shinlges in ABRACADABRA = set(['AC', 'AB', 'AD', 'CA', 'DA', 'RA', 'BR'])
Set of 2-shinlges in BRICABRAC = set(['AC', 'AB', 'CA', 'RA', 'BR', 'IC', 'RI'])

Number of 2-shinlges in ABRACADABRA =  7
Number of 2-shinlges in BRICABRAC =  7

Number of Common Shingles =  5
Total Number of Shingles =  9

Question 5

How many distinct 3-shingles are there in the string "hello world" (excluding the quotes)?



In [9]:

    
str1 = "hello world"

def shingle(str):
    s = set()
    for i,c in enumerate(str):
        if i < len(str) - 2:
            shing = c+str[i+1]+str[i+2]
            print shing
            s.add(shing)
    print
    print s
    return s

shingle1 = shingle(str1)
print 
print('Number of 3-shinlges in "hello world" = '+str(len(shingle1)));









    



hel
ell
llo
lo 
o w
 wo
wor
orl
rld

set(['lo ', 'ell', ' wo', 'wor', 'orl', 'o w', 'llo', 'rld', 'hel'])

Number of 3-shinlges in "hello world" = 9

Question 6

Suppose we want to assign points to whichever of the points (0,0) or (100,40) is nearer. Depending on whether we use the L1 or L2 norm, a point (x,y) could be clustered with a different one of these two points. For this problem, you should work out the conditions under which a point will be assigned to (0,0) when the L1 norm is used, but assigned to (100,40) when the L2 norm is used. Identify one of those points from the list: (50,18), (53,15), (56,15), (52,13).



In [10]:

    
def distance(pt1, pt2, norm=2):
    differences = sum(abs(coord1-coord2)**norm for coord1, coord2 in zip(pt1,pt2))
    return np.power(differences, 1./norm)

for pt1 in ((50,18), (53,15), (56,15), (52,13)):
    print pt1
    for norm in range(1,3):
        pt2 = ((0,0), (100,40))
        one = distance(pt1, pt2[0], norm=norm)
        two = distance(pt1, pt2[1], norm=norm)
        if one < two:
            print pt1, "assigned to", pt2[0], "under L{} norm".format(norm)
        else:
            print pt1, "assigned to", pt2[1], "under L{} norm".format(norm)
    print









    



(50, 18)
(50, 18) assigned to (0, 0) under L1 norm
(50, 18) assigned to (0, 0) under L2 norm

(53, 15)
(53, 15) assigned to (0, 0) under L1 norm
(53, 15) assigned to (100, 40) under L2 norm

(56, 15)
(56, 15) assigned to (100, 40) under L1 norm
(56, 15) assigned to (100, 40) under L2 norm

(52, 13)
(52, 13) assigned to (0, 0) under L1 norm
(52, 13) assigned to (0, 0) under L2 norm

Question 7

Suppose we have an LSH family h of (d1,d2,.6,.4) hash functions. We can use three functions from h and the AND-construction to form a (d1,d2,w,x) family, and we can use two functions from h and the OR-construction to form a (d1,d2,y,z) family. Calculate w, x, y, and z.



In [11]:

    
#import numpy as np

# LSH family h (d1, d2, p1, p2) = (d1, d2, 0.6, 0.4)
p1 = 0.6
p2 = 0.4

# We can use three functions from h and the AND-construction to form a (d1,d2,w,x) family
# we cube the probabilities associated with h
# (d1, d2, (p1)^3, (p2)^3)
def AND_hash(p):
    return p ** 3
#w = np.power(p1, 3)
#x = np.power(p2, 3)

# We can use two functions from h and the OR-construction to form a (d1,d2,y,z) family
# we take each probability associated with h, subtract it from 1, 
# square the result, and subtract that from 1
def OR_hash(p):
    return 1 - (1-p) ** 2
#y = 1 - np.power(1 - p1, 2)
#z = 1 - np.power(1 - p2, 2)



In [12]:

    
print "w =", AND_hash(p1)
print "x =", AND_hash(p2)
print "y =", OR_hash(p1)
print "z =", OR_hash(p2)









    



w = 0.216
x = 0.064
y = 0.84
z = 0.64

Question 8

There are 8 strings that represent sets: s1 = abcef; s2 = acdeg; s3 = bcdefg; s4 = adfg; s5 = bcdfgh; s6 = bceg; s7 = cdfg; s8 = abcd. The upper limit on Jaccard distance is 0.2, and we index the strings based on symbols appearing in the prefix. For each of s1, s3, and s6, determine how many other strings that string will be compared with, if it is used as the probe string.



In [13]:

    
# The 8 strings that represent sets:
s1 = "abcef"
s2 = "acdeg"
s3 = "bcdefg"
s4 = "adfg"
s5 = "bcdfgh"
s6 = "bceg"
s7 = "cdfg"
s8 = "abcd"



In [14]:

    
# Upper limit on Jaccard distance is 0.2
# We index a string of length L on the symbols appearing in its prefix of length floor(0.2L+1)
# Strings of length 5 and 6 are indexed on their first 2 symbols
# Strings of length 4 are indexed on their first symbol
Jaccard = 0.2

def prefxlen(str, Jaccard):
    return int(np.floor(len(str)*Jaccard + 1))

s = [s1, s2, s3, s4, s5, s6, s7, s8]
length = map(lambda s: prefxlen(s, Jaccard), s)
print "Prefix lenth of s1 to s8:", length









    



Prefix lenth of s1 to s8: [2, 2, 2, 1, 2, 1, 1, 1]



In [15]:

    
# We want to find the strings in index a, b and c
def index(str):
    indx = {}
    for s in str:
        for l in range(prefxlen(s, Jaccard)):
            if s[l] in indx:
                indx[s[l]].append(s)
            else:
                indx[s[l]] = [s]   
    return indx
    
indx = index(s)
print "strings in a, b and c:", indx









    



strings in a, b and c: {'a': ['abcef', 'acdeg', 'adfg', 'abcd'], 'c': ['acdeg', 'bcdefg', 'bcdfgh', 'cdfg'], 'b': ['abcef', 'bcdefg', 'bcdfgh', 'bceg']}



In [16]:

    
# Number of strings compared with each of s1, s3 and s6 when it is used as the probe string
def count(indx, s):
    counts = set(sum([indx[s[l]] for l in range(prefxlen(s, Jaccard))], []))
    counts.remove(s)
    return (s, len(counts), counts)

print "String, number of strings for comparisons, set of strings for comparisons:" 
for s in [s1, s3, s6]:
    print count(indx, s)









    



String, number of strings for comparisons, set of strings for comparisons:
('abcef', 6, set(['abcd', 'bcdfgh', 'acdeg', 'adfg', 'bceg', 'bcdefg']))
('bcdefg', 5, set(['cdfg', 'abcef', 'bcdfgh', 'acdeg', 'bceg']))
('bceg', 3, set(['bcdfgh', 'abcef', 'bcdefg']))



In [ ]: